In 2015, studies were conducted in a mouse model of how Down syndrome affects changes in the levels of various proteins.
We have uploaded a dataset for analysis from the link
Analysis of the dataset showed that the experiment involved 72 mice.
For each mouse, it was planned to carry out 15 measurements of its state, unfortunately, not all of them were performed in full. There are 1080 measurments in dataset but only 552 are full (without missing values).
We can distinguish 8 different classes of mice.
The description of these classes can be found here:
Classes:
c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)
t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)
For each class in dataset there are n measurments:
| class | n |
|---|---|
| c-CS-m | 150 |
| c-CS-s | 135 |
| c-SC-m | 150 |
| c-SC-s | 135 |
| t-CS-m | 135 |
| t-CS-s | 105 |
| t-SC-m | 135 |
| t-SC-s | 135 |
We have built boxplots to visually represent differences in BDNF_N protein expression in different classes.
Next, we applied one-way ANOVA to pinpoint the existence of differences.
However, first we checked the limits of its applicability.
We used Bartlett test to check dispersion homogentity and found dispersion heteroskedasticity with p_value = 1.381709410^{-10}.
However given that the number of observations in groups is close to the same, we can apply one-way ANOVA and Tukey’s test in order to establish differences in groups
As we can see p-value is less then 0.05 so there are significant differences in mean of groups. We applied the Tukey criterion in order to find out between which groups there are differences.
The figure shows the differences between the group averages (Differences in mean levels of class) and their confidence intervals, calculated taking into account the control over the group error probability (95% family-wise confidence level). In 13 cases, the confidence intervals include 0, indicating no difference between the respective groups. We can conclude that there are significant differences between the other groups.
We built a multiple linear model, in which ERBB4_N protein level acted as a dependent variable, and all other protein levels as independent predictors
##
## Call:
## lm(formula = ERBB4_N ~ . - pS6_N, data = df_protein_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.023037 -0.003370 -0.000127 0.003071 0.031955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.566e-02 8.245e-03 3.112 0.001970 **
## DYRK1A_N -6.824e-03 7.475e-03 -0.913 0.361710
## ITSN1_N 8.127e-03 1.058e-02 0.768 0.442776
## BDNF_N 3.833e-02 1.919e-02 1.997 0.046378 *
## NR1_N -1.409e-02 6.685e-03 -2.108 0.035544 *
## NR2A_N -1.049e-04 1.361e-03 -0.077 0.938610
## pAKT_N 7.961e-02 3.044e-02 2.615 0.009203 **
## pBRAF_N -4.103e-02 3.065e-02 -1.339 0.181356
## pCAMKII_N -1.451e-05 7.954e-04 -0.018 0.985451
## pCREB_N -6.015e-02 2.910e-02 -2.067 0.039290 *
## pELK_N 4.731e-05 1.030e-03 0.046 0.963391
## pERK_N -1.987e-02 7.887e-03 -2.519 0.012109 *
## pJNK_N -6.035e-02 2.538e-02 -2.378 0.017814 *
## PKCA_N 8.377e-03 2.736e-02 0.306 0.759625
## pMEK_N 8.498e-03 2.706e-02 0.314 0.753676
## pNR1_N -2.566e-02 1.334e-02 -1.924 0.054968 .
## pNR2A_N 1.004e-02 7.239e-03 1.388 0.165927
## pNR2B_N 1.928e-02 6.659e-03 2.896 0.003959 **
## pPKCAB_N 5.727e-03 2.773e-03 2.065 0.039457 *
## pRSK_N 1.001e-02 1.427e-02 0.701 0.483573
## AKT_N -6.738e-04 8.183e-03 -0.082 0.934413
## BRAF_N 1.550e-02 1.213e-02 1.278 0.201926
## CAMKII_N -3.429e-03 2.011e-02 -0.170 0.864700
## CREB_N -1.450e-02 3.091e-02 -0.469 0.639135
## ELK_N 3.510e-03 3.719e-03 0.944 0.345811
## ERK_N 4.660e-03 2.804e-03 1.662 0.097208 .
## GSK3B_N -5.840e-03 6.824e-03 -0.856 0.392588
## JNK_N -1.188e-02 3.378e-02 -0.352 0.725317
## MEK_N 8.785e-03 2.197e-02 0.400 0.689510
## TRKA_N 5.722e-03 1.646e-02 0.348 0.728257
## RSK_N -2.299e-02 3.915e-02 -0.587 0.557214
## APP_N -5.345e-03 1.825e-02 -0.293 0.769751
## Bcatenin_N 1.107e-03 5.298e-03 0.209 0.834546
## SOD1_N -6.241e-03 3.985e-03 -1.566 0.118018
## MTOR_N 4.161e-02 1.651e-02 2.520 0.012048 *
## P38_N -1.619e-02 1.348e-02 -1.201 0.230267
## pMTOR_N -9.530e-03 7.667e-03 -1.243 0.214475
## DSCR1_N -6.849e-03 1.023e-02 -0.670 0.503374
## AMPKA_N 2.522e-02 2.294e-02 1.099 0.272263
## NR2B_N 7.391e-03 9.146e-03 0.808 0.419449
## pNUMB_N -6.467e-03 2.006e-02 -0.322 0.747326
## RAPTOR_N 4.750e-02 2.465e-02 1.927 0.054629 .
## TIAM1_N -3.284e-02 1.956e-02 -1.679 0.093892 .
## pP70S6_N 2.739e-03 5.307e-03 0.516 0.606022
## NUMB_N -3.304e-02 3.784e-02 -0.873 0.383042
## P70S6_N -4.357e-03 5.119e-03 -0.851 0.395118
## pGSK3B_N 1.291e-01 4.063e-02 3.179 0.001577 **
## pPKCG_N -7.636e-03 1.962e-03 -3.893 0.000113 ***
## CDK5_N 7.111e-04 9.807e-03 0.073 0.942225
## S6_N 1.566e-02 7.647e-03 2.049 0.041059 *
## ADARB1_N -2.609e-03 2.040e-03 -1.279 0.201557
## AcetylH3K9_N 2.113e-03 1.103e-02 0.192 0.848199
## RRP1_N -1.488e-02 1.038e-02 -1.434 0.152232
## BAX_N -1.304e-01 3.542e-02 -3.680 0.000260 ***
## ARC_N 1.687e-01 6.748e-02 2.499 0.012775 *
## nNOS_N -4.749e-03 2.854e-02 -0.166 0.867918
## Tau_N 7.082e-02 1.810e-02 3.913 0.000104 ***
## GFAP_N -4.703e-02 5.400e-02 -0.871 0.384291
## GluR3_N -7.133e-03 2.436e-02 -0.293 0.769812
## GluR4_N -7.030e-02 3.373e-02 -2.084 0.037672 *
## IL1B_N 2.845e-02 1.081e-02 2.631 0.008794 **
## P3525_N 4.994e-02 2.423e-02 2.061 0.039810 *
## pCASP9_N 8.118e-03 3.044e-03 2.667 0.007923 **
## PSD95_N 2.379e-02 3.811e-03 6.242 9.55e-10 ***
## SNCA_N -1.137e-02 3.484e-02 -0.326 0.744195
## Ubiquitin_N 2.330e-03 4.883e-03 0.477 0.633468
## pGSK3B_Tyr216_N 2.808e-02 1.006e-02 2.793 0.005439 **
## SHH_N 4.904e-02 1.998e-02 2.454 0.014471 *
## BAD_N -3.592e-02 3.508e-02 -1.024 0.306311
## BCL2_N 6.173e-03 3.112e-02 0.198 0.842840
## pCFOS_N -1.232e-02 3.111e-02 -0.396 0.692240
## SYP_N 1.354e-02 1.079e-02 1.254 0.210395
## H3AcK18_N 2.401e-03 2.527e-02 0.095 0.924349
## EGR1_N -6.016e-03 2.619e-02 -0.230 0.818406
## H3MeK4_N 1.227e-02 2.154e-02 0.570 0.569097
## CaNA_N -1.749e-03 3.419e-03 -0.512 0.609159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005885 on 476 degrees of freedom
## Multiple R-squared: 0.8641, Adjusted R-squared: 0.8427
## F-statistic: 40.37 on 75 and 476 DF, p-value: < 2.2e-16
The first graph shows that the residues are unevenly distributed. We can see dispersion heteroskedasticity. Also here we can notice nonlinearity of relationship as not all observations are in the +/- 2 standard deviation zone. In addition, the residuals are unevenly distributed, we conclude that the variables are not independent.
Cook’s distance graph shows that there is no influential observations.
Distribution graf shows that residual’s distribution is slightly differs from normal .
To assess the presence of multicollinearity in our model, we built a correlation matrix, but visualized only significant interactions (significant level = 0.8).
As we can see, there are quite a few interactions between dependent variables.
Based on all the diagnostics of the model, we can conclude that the use of a linear model in this case is not an advantageous solution.
We can try to optimize this model.
After removing 12 predictors from the model, with the largest VIF, multicollinearity is still observed in the model.
## DYRK1A_N BDNF_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N pELK_N
## 18.761162 15.191238 20.760271 19.833740 9.486633 12.983859 13.242561 2.250339
## pJNK_N pMEK_N pNR1_N pNR2A_N pPKCAB_N pRSK_N AKT_N BRAF_N
## 22.297602 21.584814 23.373532 21.317910 19.454005 13.473465 14.546179 17.881495
Only 16 VIFs are showm.
Thus, it makes no sense to further try to optimize the model. Better to use another method.
We performed principal component analisys of mice dataframe to reduce a number of dimensions.
The graph shows the contribution of the components to the total variability. For further analysis, we will leave only the components whose contribution is greater than in Broken Stick Model.
After that, we built the ordination in the axes of the first two principal components
Unfortunately, grouping each class separately does not give clear visualization. Therefore, I tried to build an ordination with division into two types of mice (stimulated / non stimulated).
We can see that now the groups are more pronounced for stimulated and non stimulated mice. However there are almost no difference between Control and Down syndrom mice and stimulation with both saline and memantine affects them equally.
We can conclude that there are a lot of proteins correlated with each other. Probably we should remove some of them from analisys.
As we see see points do not form any logical clusters.
We can try to plot graphs by grouping points in a different way.
We tried to plot graphs for trisomy / non-trisomy groups and stimulated / non-stimulated mice.
Only on the graph showing the groups of stimulated / not stimulated mice, we can see some differences in the groups. (As at 2D PCA plot)
Since the contribution of the last components is too small, only the first 15 are displayed
| PC_N | percent |
|---|---|
| PC1 | 29.9334664 |
| PC2 | 17.3490002 |
| PC3 | 9.8525452 |
| PC4 | 9.3007018 |
| PC5 | 4.2628791 |
| PC6 | 3.9510904 |
| PC7 | 3.0744486 |
| PC8 | 2.9289286 |
| PC9 | 2.3776033 |
| PC10 | 1.6606732 |
| PC11 | 1.4437148 |
| PC12 | 1.2767892 |
| PC13 | 1.1207064 |
| PC14 | 0.9102884 |
| PC15 | 0.7937266 |
We’ve count differences using DESeq2 for all possible conditions (together and separatly).
## log2 fold change (MLE): condition t.SC.s vs c.CS.m
## Wald test p-value: condition t.SC.s vs c.CS.m
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## SOD1_N 52.1509 1.167145 0.0640616 18.21910 3.64066e-74 2.80331e-72
## pERK_N 52.2565 -0.891397 0.0600636 -14.84089 7.96959e-50 3.06829e-48
## BRAF_N 34.9634 -0.671304 0.0540479 -12.42052 2.02242e-35 5.19088e-34
## GSK3B_N 116.9952 -0.316110 0.0266573 -11.85827 1.94957e-32 3.75292e-31
## DYRK1A_N 40.9928 -0.512317 0.0577757 -8.86734 7.49153e-19 1.15370e-17
## pPKCAB_N 160.0989 -0.386464 0.0485008 -7.96819 1.61018e-15 2.06640e-14
Here we can see DESeq2 calculated list of differrentially expressed proteins sorted by p_value (adjusted) taking into account all the classes at the same time.
Also we’ve prepared a gene list includes only reliably altered proteins. (it can be used in subsequent research )
In the table we can see the most changede genes per each condition. They can be used for Genquery or MSigDB.
| Trisomy/Control | Saline/Memantine | Stimulated/Non_stimulated | All |
|---|---|---|---|
| Tau_N | pCAMKII_N | SOD1_N | SOD1_N |
| ITSN1_N | BRAF_N | pERK_N | pERK_N |
| S6_N | ERK_N | CaNA_N | BRAF_N |
| AcetylH3K9_N | DYRK1A_N | GSK3B_N | GSK3B_N |
| APP_N | pERK_N | pPKCAB_N | DYRK1A_N |
| DYRK1A_N | NUMB_N | DYRK1A_N | pPKCAB_N |
| pNR1_N | ELK_N | BRAF_N | NR1_N |
| P38_N | AKT_N | ITSN1_N | Tau_N |
| MTOR_N | CDK5_N | Ubiquitin_N | CaNA_N |
| BRAF_N | SYP_N | P38_N | ITSN1_N |
| GluR3_N | P38_N | pMTOR_N | pNR1_N |
| pNR2A_N | pMTOR_N | AKT_N | Ubiquitin_N |
| pMTOR_N | DSCR1_N | S6_N | pELK_N |
| NR1_N | ITSN1_N | IL1B_N | NR2A_N |
| NR2B_N | IL1B_N | pCAMKII_N | pNR2B_N |
| AMPKA_N | CaNA_N | pNR2A_N | H3MeK4_N |
| pNR2B_N | H3AcK18_N | NR2B_N | ADARB1_N |
| H3AcK18_N | pPKCAB_N | pNUMB_N | H3AcK18_N |
| pPKCG_N | ADARB1_N | PKCA_N | GluR3_N |
| pERK_N | SOD1_N | H3MeK4_N | EGR1_N |
| PSD95_N | pGSK3B_N | ADARB1_N | TRKA_N |
| NR2A_N | NR1_N | EGR1_N | TIAM1_N |
| AKT_N | BAX_N | PSD95_N | AcetylH3K9_N |
| IL1B_N | Ubiquitin_N | MTOR_N | BCL2_N |
| pCASP9_N | pNR2A_N | pAKT_N | BAD_N |
| SYP_N | pP70S6_N | SNCA_N | pP70S6_N |
| pP70S6_N | Bcatenin_N | pMEK_N | pCASP9_N |
| EGR1_N | PSD95_N | NUMB_N | RRP1_N |
| pCFOS_N | pRSK_N | ERK_N | PKCA_N |
| SNCA_N | NR2B_N | pGSK3B_Tyr216_N | pCREB_N |
We visualized the differences in expression in several proteins with the lowest p adjusted to estimate the effect of different influence at mice.